Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

BONN: Bayesian Optimized Binary Neural Network

Algorithm 5 Optimizing 1-bit CNNs with Bayesian Learning

Input:

The full-precision kernels k, the reconstruction vector w, the learning rate η, regularization

parameters λ, θ and variance ν, and the training dataset.

Output:

The BONN with the updated k, w, μ, σ, cm, σm.

1: Initialize k and w randomly, and then estimate μ, σ based on the average and variance of k,

respectively;

2: repeat

// Forward propagation

for l = 1 to L do

ˆk^l

i ⁼^w^l^◦^sign(^k^l

i⁾^,^∀ⁱ^{; // Each element of}^w^l^{is replaced by the average of all elements}^w^l^.

Perform activation binarization; // Using the sign function

Perform 2D convolution with ^ˆk^l

i^,^∀ⁱ^;

end for

// Backward propagation

10:

Compute δˆkl

i ⁼^∂L^s

∂^ˆk^l

i ^,^∀^{l, i}^;

11:

for l = L to 1 do

12:

Calculate δkl

i^,^δ^w^l^,^δ^μ^l

i^,^δ^σ^l

i^{; // using Eqs. 3.112}^∼^3.119

13:

Update parameters k^l

i^,^w^l^{, μ}^l

i^{, σ}^l

i ^{using SGD;}

14:

end for

15:

Update cm, σm;

16: until convergence

where w denotes a learned vector to reconstruct the full precision vector and is shared in a

layer. As mentioned in Section 3.2, during forward propagation, w^lbecomes a scalar w^lin

each layer, where w^lis the mean of w^land is calculated online. The convolution process is

represented as

O^l⁺¹= ((w^l)⁻¹^ˆ

K^l) ∗^ˆ

O^l= (w^l)⁻¹( ^ˆ

K^l∗^ˆ

O^l),

(3.111)

where ^ˆ

O^ldenotes the binarized feature map of the l-th layer, and O^l⁺¹is the feature map

of the (l + 1)-th layer. As in Eq. 3.111 depicts, the actual convolution is still binary, and

O^l⁺¹is obtained by simply multiplying (w^l)⁻¹and the binarization convolution. For each

layer, only one ﬂoating-point multiplication is added, which is negligible for BONNs.

In addition, we consider the Gaussian distribution in the forward process of Bayesian

pruning, which updates every ﬁlter in one group based on its mean. Speciﬁcally, we replace

each ﬁlter K^l

i,j ^{= (1}⁻^γ⁾^K^l

i,j ⁺^γ^K

j ^{during pruning.}

3.7.6

Asynchronous Backward Propagation

To minimize Eq. 3.108, we update k^l,i

n ^,^w^l^,^μ^l

i^,^σ^l

i^,^c^m^{, and}^σ^m^{using stochastic gradient}

descent (SGD) in an asynchronous manner, which updates w instead of w as elaborated

below.

Updating k^l,i

n ^{: We deﬁne}^δk^l,i

n ^{as the gradient of the full-precision kernel}^k^l,i

n ^{, and we have:}

δkl,i

n ⁼^∂L

∂k^l,i

= ^∂L^S

∂k^l,i

+ ^∂L^B

∂k^l,i

(3.112)